6. K-Means Clustering

Aim

To understand and implement the K-Means Clustering algorithm for grouping unlabeled data into distinct clusters based on feature similarity, and to analyze how the choice of ‘k’ (number of clusters) and initial centroids influence clustering results and performance..

Understand the K-Means Clustering Algorithm Before You Begin

Overview: K-Means Clustering is an unsupervised machine learning algorithm used to group data points into a predefined number of clusters (k). It works by assigning each data point to the cluster with the nearest centroid (mean of the cluster), then iteratively updating the centroids until the clusters become stable.

The main goal of K-Means is to minimize the intra-cluster distance (distance between points within the same cluster) and maximize the inter-cluster distance (distance between different clusters). It is widely used for pattern recognition, customer segmentation, and data compression.

Further Understanding: K-Means Clustering

Test Your Understanding

Algorithm

Import Libraries: Import necessary libraries, including matplotlib and scikit-learn.
Generate Synthetic Data: Create synthetic data using a function like make_blobs from scikit-learn, specifying parameters such as the number of samples, features, centers, cluster standard deviation, shuffle, and random state.
Visualize Data: Plot the synthetic data points using matplotlib to visualize the dataset.
Initialize KMeans:Create an instance of the KMeans algorithm, specifying parameters such as the number of clusters, initialization method, number of initializations, maximum iterations, tolerance, and random state.
Fit and Predict:Use the fit_predict method of KMeans to fit the model to the data and predict cluster labels for each data point.
Visualize Clusters:Plot the data points for each cluster separately, coloring them differently for better visualization. Also, plot the centroids of each cluster.
Display Plot:Show the plot with the data points, cluster points, and centroids using matplotlib. .

About Iris Dataset

To create blobs of points with a Gaussian distribution, use the make blobs () method. You may specify the number of blobs and samples to be generated, as well as a variety of other parameters. Since the blobs are linearly separable, the problem lends itself to linear classification problems. As a multi-class classification prediction problem, the example below generates a 2D dataset of samples with three blobs. Each observation has two inputs and a class value of 0, 1, or 2. Running the example produces the problem's inputs and outputs, as well as a convenient 2D plot with points for the various groups colored differently. Due to the stochastic design of the problem generator, your particular dataset and resulting plot can differ. This isn't a flaw; it's a feature. The make moons () function generates a swirl pattern or two moons for binary classification. You can adjust the amount of noise in the moon shapes as well as the number of samples produced. This challenge is fitting for algorithms that can learn nonlinear class boundaries. With the make circles () function, you can create a binary classification problem with datasets that are arranged in concentric circles. You can regulate the amount of noise in the shapes, just as you can with the moons test issue. This is a strong test problem for algorithms that can learn non-linear complex manifolds.

Dataset Information

Number of Samples	100
Number of Features	2
Cluster Standard Deviation	0.5

Source: Dataset Link

Simulation

Interactive Simulation of K-Means Clustering Algorithm.

Open Simulation

Pre-Lab Questions

In the sense of the K-means algorithm, what is the concept of clusters and centroids?
What are the meanings of the K means input arguments?

Post-Lab Questions

Modify the code for make moons dataset and display the classification plot? Give explanation about the performance improvement or degradation using new dataset?
Is K-means clustering applicable to non-uniform cluster sample sizes? Provide code examples

Result

The K-Means clustering algorithm was successfully applied to a synthetic dataset generated using make_blobs. The model effectively grouped the data into three clusters, and the centroids were clearly visualized, confirming correct cluster formation.